{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "xh2G62DR0GEQ"
   },
   "source": [
    "# Ungraded lab: Algorithmic Dimensionality Reduction \n",
    "------------------------\n",
    " \n",
    "Welcome, during this ungraded lab you are going to perform several algorithms that aim to reduce the dimensionality of data. This topic is very important because it is not uncommon that reduced models outperform the ones trained on the raw data because noise and redundant information are present in most datasets. This will also allow your models to train and make predictions faster, which might be really important depending on the problem you are working on. In particular you will:\n",
    "\n",
    "\n",
    "1. Use Principal Component Analysis (**PCA**) to reduce the dimensionality of a dataset that classifies celestial bodies.\n",
    "2. Use Single Value Decomposition (**SVD**) to create low level representations of images of handwritten digits.\n",
    "3. Use Non-negative Matrix Factorization (**NMF**) to segment text into topics.\n",
    "\n",
    "Let's get started!\n",
    " \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "N7DtysD30vLM"
   },
   "outputs": [],
   "source": [
    "# General use imports\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_F3XbdaatzRM"
   },
   "source": [
    "## Principal Components Analysis - PCA\n",
    "\n",
    "This is an unsupervised algorithm that creates linear combinations of the original features. PCA is a widely used technique for dimension reduction since it is fast and easy to implement. PCA aims to keep as much variance as possible from the original data in a lower dimensional space. It finds the best axis to project the data so that the variance of the projections is maximized.\n",
    "\n",
    "In the lecture you saw PCA applied to the Iris dataset. This dataset has been used extensively to showcase PCA so here you are going to do something different. You are going to use the [HTRU_2](https://archive.ics.uci.edu/ml/datasets/HTRU2) dataset which describes several celestial objects and the idea is to be able to classify if an object is a pulsar star or not.\n",
    "\n",
    "Begin by downloading the dataset:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "-tF29GljKdR6"
   },
   "outputs": [],
   "source": [
    "# Download zip file\n",
    "!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00372/HTRU2.zip\n",
    "\n",
    "# Unzip it\n",
    "!unzip HTRU2.zip"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "sm6ONHW1pB58"
   },
   "source": [
    "Load the data into a dataframe for easier inspection:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "iRje5zNoJFnN"
   },
   "outputs": [],
   "source": [
    "# Load data into a pandas dataframe\n",
    "data = pd.read_csv(\"HTRU_2.csv\", names=['mean_ip', 'sd_ip', 'ec_ip', \n",
    "                                        'sw_ip', 'mean_dm', 'sd_dm', \n",
    "                                        'ec_dm', 'sw_dm', 'pulsar'])\n",
    "\n",
    "# Take a look at the data\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1KAMRd_rPrMG"
   },
   "source": [
    "This dataset has 8 numerical features (the \"pulsar\" column is the label). Now you are going to perform PCA reduce this 8th-dimensional input space to a lower dimensional one.\n",
    "\n",
    "But first, scale the data. If you do an exploratory analysis of the data you will see that this dataset has a lot of outliers. Because of this you are going to use a [RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html), which scales features using statistics that are robust to outliers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "iLYmitEkJlAj"
   },
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import RobustScaler\n",
    "\n",
    "# Split features from labels\n",
    "features = data[[col for col in data.columns if col != \"pulsar\"]]\n",
    "labels = data[\"pulsar\"]\n",
    "\n",
    "# Scale data\n",
    "robust_data = RobustScaler().fit_transform(features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "h3jjKItOYQTa"
   },
   "source": [
    "Now perform PCA using sklearn. In this first iteration you are going to create a principal component for each one of the features so there is no dimensionality reduction:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "zHBa-7I5J8h-"
   },
   "outputs": [],
   "source": [
    "from sklearn.decomposition import PCA\n",
    "\n",
    "# Instantiate PCA without specifying number of components\n",
    "pca_all = PCA()\n",
    "\n",
    "# Fit to scaled data\n",
    "pca_all.fit(robust_data)\n",
    "\n",
    "# Save cumulative explained variance\n",
    "cum_var = (np.cumsum(pca_all.explained_variance_ratio_))\n",
    "n_comp = [i for i in range(1, pca_all.n_components_ + 1)]\n",
    "\n",
    "# Plot cumulative variance\n",
    "ax = sns.pointplot(x=n_comp, y=cum_var)\n",
    "ax.set(xlabel='number of  principal components', ylabel='cumulative explained variance')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "HTnErJpNYiv6"
   },
   "source": [
    "Wow! With just 3 components almost all of the variance of the original data is explained! This makes you think that there were some highly correlated features in the original data.\n",
    "\n",
    "Let's plot the first 3 principal components:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "bCj4yZkeK9-c"
   },
   "outputs": [],
   "source": [
    "from mpl_toolkits.mplot3d import Axes3D\n",
    "\n",
    "# Instantiate PCA with 3 components\n",
    "pca_3 = PCA(3)\n",
    "\n",
    "# Fit to scaled data\n",
    "pca_3.fit(robust_data)\n",
    "\n",
    "# Transform scaled data\n",
    "data_3pc = pca_3.transform(robust_data)\n",
    "\n",
    "# Render the 3D plot\n",
    "fig = plt.figure(figsize=(15,15))\n",
    "ax = fig.add_subplot(111, projection='3d')\n",
    "\n",
    "\n",
    "ax.scatter(data_3pc[:, 0], data_3pc[:, 1], data_3pc[:, 2], c=labels,\n",
    "           cmap=plt.cm.Set1, edgecolor='k', s=25, label=data['pulsar'])\n",
    "\n",
    "ax.legend([\"non-pulsars\"], fontsize=\"large\")\n",
    "\n",
    "ax.set_title(\"First three PCA directions\")\n",
    "ax.set_xlabel(\"1st principal component\")\n",
    "ax.w_xaxis.set_ticklabels([])\n",
    "ax.set_ylabel(\"2nd principal component\")\n",
    "ax.w_yaxis.set_ticklabels([])\n",
    "ax.set_zlabel(\"3rd principal component\")\n",
    "ax.w_zaxis.set_ticklabels([])\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ywNacdDYZ4Ku"
   },
   "source": [
    "It is possible to visualize a plane that would be able to separate both classes since non-pulsars tend to group on the edge of this surface while pulsars are mostly located on the inner side of the surface.\n",
    "\n",
    "In this case it is reasonable to think that the dimension can be reduced even more since with 2 principal components more than 95% of the variance of the original data is explained. Now let's plot just the first two principal components:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "GKsZa_rjKEFU"
   },
   "outputs": [],
   "source": [
    "# Instantiate PCA with 2 components\n",
    "pca_2 = PCA(2)\n",
    "\n",
    "# Fit and transform scaled data\n",
    "pca_2.fit(robust_data)\n",
    "data_2pc = pca_2.transform(robust_data)\n",
    "\n",
    "# Render the 2D plot\n",
    "ax = sns.scatterplot(x=data_2pc[:,0], \n",
    "                     y=data_2pc[:,1], \n",
    "                     hue=labels,\n",
    "                     palette=sns.color_palette(\"muted\", n_colors=2))\n",
    "\n",
    "ax.set(xlabel='1st principal component', ylabel='2nd principal component', title='First two PCA directions')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2dWYFdtQ-7KX"
   },
   "source": [
    "Even in 2D the 2 classes look linearly separable (not perfectly, of course) but this is quite remarkable considering that the initial space was 8th dimensional.\n",
    "\n",
    "Using PCA you've successfully reduced the dimensionality from 8 to 2 while maintaining a lot of the variance of the original data!\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "MbjFEgeU2C2U"
   },
   "source": [
    "## Singular Value Decomposition - SVD\n",
    "\n",
    "SVD is one way to decompose matrices. Remember that matrices can be seen as linear transformations in space. PCA relies on eigendecomposition, which can only be done for square matrices. You might wonder why the first example worked with PCA if the data had far more observations than features. The reason is that when performing PCA you end up using the matrix product $X^{t}X$ which is a square matrix.\n",
    "\n",
    "However you don’t always have square matrices, and sometimes you have really sparse matrices.\n",
    "\n",
    "To decompose these types of matrices, which can’t be decomposed with eigendecomposition, you can use techniques such as Singular Value Decomposition. SVD decomposes the original dataset into its constituents, resulting in a reduction of dimensionality. It is used to remove redundant features from the dataset.\n",
    "\n",
    "To check SVD you are going to use the [digits dataset](https://scikit-learn.org/stable/auto_examples/datasets/plot_digits_last_image.html), which is made up of 1797 8x8 images of handwritten digits:\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "LvtQsotB3taQ"
   },
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_digits\n",
    "\n",
    "# Load the digits dataset\n",
    "digits = load_digits()\n",
    "\n",
    "# Plot first digit\n",
    "image = digits.data[0].reshape((8, 8))\n",
    "plt.matshow(image, cmap = 'gray')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "MBM3Z_fxlVmF"
   },
   "source": [
    "\n",
    "Let's continue by normalizing the data and checking its dimensions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Nn9KmwoXiYTl"
   },
   "outputs": [],
   "source": [
    "# Save data into X variable\n",
    "X = digits.data\n",
    "\n",
    "# Normalize pixel values\n",
    "X = X/255\n",
    "\n",
    "# Print shapes of dataset and data points\n",
    "print(f\"Digits data has shape {X.shape}\\n\")\n",
    "print(f\"Each data point has shape {X[0].shape}\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "y2lyZgeXsMym"
   },
   "source": [
    "Plot the first digit to check how normalization affects the images:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "yavX0PsNppOp"
   },
   "outputs": [],
   "source": [
    "image = X[0].reshape((8, 8))\n",
    "plt.matshow(image, cmap = 'gray')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "hnWLcJ9-nej2"
   },
   "source": [
    "The image should be identical to the one without normalization. This is because the relative brightness of each pixel with the others is maintained. Normalization is done as a preprocessing step when feeding the data into a Neural Network. Here it is done since it is a step that is usually always done when working with image data.\n",
    "\n",
    "Now perform SVD on the data and plot the cumulative amount of explained variance for every number of components. Note that the [TruncatedSVD](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html) needs a number of components strictly lower to the number of features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "qdyxjZtJo5aJ"
   },
   "outputs": [],
   "source": [
    "from sklearn.decomposition import TruncatedSVD\n",
    "\n",
    "# Instantiate Truncated SVD with (original dimension - 1) components\n",
    "org_dim = X.shape[1]\n",
    "tsvd = TruncatedSVD(org_dim - 1)\n",
    "tsvd.fit(X)\n",
    "\n",
    "# Save cumulative explained variance\n",
    "cum_var = (np.cumsum(tsvd.explained_variance_ratio_))\n",
    "n_comp = [i for i in range(1, org_dim)]\n",
    "\n",
    "# Plot cumulative variance\n",
    "ax = sns.scatterplot(x=n_comp, y=cum_var)\n",
    "ax.set(xlabel='number of  components', ylabel='cumulative explained variance')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "J0iqv-GNWSq4"
   },
   "source": [
    "Looking at the plot it can be seen that with only 5 components near the 50% of the variance of the original data is explained.\n",
    "\n",
    "Let's double check this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "A46cvALvWha3"
   },
   "outputs": [],
   "source": [
    "print(f\"Explained variance with 5 components: {float(cum_var[4:5])*100:.2f}%\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "kJpLDxE7tX_s"
   },
   "source": [
    "It is not a lot but let's check what you get when performing SVD with only 5 components:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "3m5v4LRO2ELC"
   },
   "outputs": [],
   "source": [
    "# Instantiate a Truncated SVD with 5 components\n",
    "tsvd = TruncatedSVD(n_components=5)\n",
    "\n",
    "# Get the transformed data\n",
    "X_tsvd = tsvd.fit_transform(X)\n",
    "\n",
    "# Print shapes of dataset and data points\n",
    "print(f\"Original data points have shape {X[0].shape}\\n\")\n",
    "print(f\"Transformed data points have shape {X_tsvd[0].shape}\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "tCFZpcFTtoKj"
   },
   "source": [
    "By doing this you are now representing each digit using 5 dimensions instead of the original 64! Isn't that amazing?\n",
    "\n",
    "Now check how this looks like visually:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "2uikZXMc3Ot0"
   },
   "outputs": [],
   "source": [
    "image_reduced_5 = tsvd.inverse_transform(X_tsvd[0].reshape(1, -1))\n",
    "image_reduced_5 = image_reduced_5.reshape((8, 8))\n",
    "plt.matshow(image_reduced_5, cmap = 'gray')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "nAxl1kCV6hC7"
   },
   "source": [
    "It looks blurry but you can still tell this is a zero.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Nz9LV7wJ6yRV"
   },
   "source": [
    "### Using more components\n",
    "\n",
    "Let’s try again, only, this time, we use half the number of features in the original data. \n",
    "\n",
    "But first define a function that performs this process for any number of components:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "vIkuX6vwqvMA"
   },
   "outputs": [],
   "source": [
    "def image_given_components(n_components, verbose=True):\n",
    "  tsvd = TruncatedSVD(n_components=n_components)\n",
    "  X_tsvd = tsvd.fit_transform(X)\n",
    "  if verbose:\n",
    "    print(f\"Explained variance with {n_components} components: {float(tsvd.explained_variance_ratio_.sum())*100:.2f}%\\n\")\n",
    "  image = tsvd.inverse_transform(X_tsvd[0].reshape(1, -1))\n",
    "  image = image.reshape((8, 8))\n",
    "  return image\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "McjDeh31wTbC"
   },
   "source": [
    "Use the function to generate the image that use 32 components:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "KC8nnFAEYN5D"
   },
   "outputs": [],
   "source": [
    "image_reduced_32 = image_given_components(32)\n",
    "plt.matshow(image_reduced_32, cmap = 'gray')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "tJHJrPYSvwNr"
   },
   "source": [
    "Wow! This image looks very similar to the original one (no wonder since more than 95% of the original variance is explained) but the dimensionality of the representations have been cut in half!\n",
    "\n",
    "To better grasp how the images look like depending on the dimensionality of the representations, the next cell plots them side by side (the last figure has a parameter that you can tweak):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "GgZhbQeRY2vV"
   },
   "outputs": [],
   "source": [
    "fig = plt.figure()\n",
    "\n",
    "# Original image\n",
    "ax1 = fig.add_subplot(1,4,1)\n",
    "ax1.matshow(image, cmap = 'gray')\n",
    "ax1.title.set_text('Original')\n",
    "ax1.axis('off') \n",
    "\n",
    "# Using 32 components\n",
    "ax2 = fig.add_subplot(1,4,2)\n",
    "ax2.matshow(image_reduced_32, cmap = 'gray')\n",
    "ax2.title.set_text('32 components')\n",
    "ax2.axis('off') \n",
    "\n",
    "# Using 5 components\n",
    "ax3 = fig.add_subplot(1,4,3)\n",
    "ax3.matshow(image_reduced_5, cmap = 'gray')\n",
    "ax3.title.set_text('5 components')\n",
    "ax3.axis('off') \n",
    "\n",
    "# Using 1 components\n",
    "ax4 = fig.add_subplot(1,4,4)\n",
    "ax4.matshow(image_given_components(1), cmap = 'gray') # Change this parameter to see other representations\n",
    "ax4.title.set_text('1 component')\n",
    "ax4.axis('off')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Au1wwOy2wyAV"
   },
   "source": [
    "Notice how with 1 component it is not possible to determine that the image is a zero. What is the minimun number of components that are needed for this? Be sure to try out different values and see what you get!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "MPsBWkL07Srw"
   },
   "source": [
    "## Non-negative Matrix Factorization - NMF\n",
    "\n",
    "NMF expresses samples as combinations of interpretable parts. For example, it represents documents as combinations of topics, and images in terms of commonly occurring visual patterns. NMF, like PCA, is a dimensionality reduction technique. In contrast to PCA, however, NMF models are interpretable. This means NMF models are easier to understand and much easier for us to explain to others. NMF can't be applied to every dataset, however. It requires the sample features be non-negative, so greater than or equal to 0. \n",
    "\n",
    "To test NMF you will use the [20newsgroups dataset](https://scikit-learn.org/stable/datasets/real_world.html#the-20-newsgroups-text-dataset) which comprises around 12000 newsgroups posts on 20 topics. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "I9btFVismWmz"
   },
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.decomposition import NMF\n",
    "from sklearn.datasets import fetch_20newsgroups\n",
    "\n",
    "# Download data\n",
    "data = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))\n",
    "\n",
    "# Get the actual text data from the sklearn Bunch\n",
    "data = data.get(\"data\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "mimj0peyrSzn"
   },
   "source": [
    "At this point you have the data in a list format. Let's check it out:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "-6toK2WNq8lX"
   },
   "outputs": [],
   "source": [
    "print(f\"Data has {len(data)} elements.\\n\")\n",
    "print(f\"First 2 elements: \\n\")\n",
    "for n, d in enumerate(data[:2], start=1):\n",
    "  print(\"======\"*10)\n",
    "  print(f\"Element number {n}:\\n\\n{d}\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "FWyR9Muzsm6r"
   },
   "source": [
    "Notice that you only have the actual text without information of the topic it belongs to (labels). \n",
    "\n",
    "Now you need to represent the text as vectors, for this you will use a [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer) with `max_features` set to 500. This will be the original dimensionality of the data (which you will reduce via NMF)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "8k2cCmaftiPc"
   },
   "outputs": [],
   "source": [
    "# Instantiate vectorizer setting dimensionality of data\n",
    "# The stop_words param refer to words (in english) that don't add much value to the content of the document and must be ommited\n",
    "vectorizer = TfidfVectorizer(max_features=500, stop_words='english')\n",
    "\n",
    "# Vectorize original data\n",
    "vect_data = vectorizer.fit_transform(data)\n",
    "\n",
    "\n",
    "# Print dimensionality\n",
    "print(f\"Data has shape {vect_data.shape} after vectorization.\")\n",
    "print(f\"Each data point has shape {vect_data[0].shape} after vectorization.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "wheFr1G8udU2"
   },
   "source": [
    "Every one of the texts in the original data is represented as a 1x500 vector.\n",
    "\n",
    "Now use NMF to reduce this dimensionality:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Zc10PMmzmcp3"
   },
   "outputs": [],
   "source": [
    "# Desired number of components\n",
    "n_comp = 5\n",
    "\n",
    "# Instantiate NMF with the desired number of components\n",
    "nmf = NMF(n_components=n_comp, random_state=42)\n",
    "\n",
    "# Apply NMF to the vectorized data\n",
    "nmf.fit(vect_data)\n",
    "\n",
    "reduced_vect_data = nmf.transform(vect_data)\n",
    "\n",
    "# Print dimensionality\n",
    "print(f\"Data has shape {reduced_vect_data.shape} after NMF.\")\n",
    "print(f\"Each data point has shape {reduced_vect_data[0].shape} after NMF.\")\n",
    "\n",
    "# Save feature names for plotting\n",
    "feature_names = vectorizer.get_feature_names_out()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "H3h1UGBuwE-y"
   },
   "source": [
    "Now every data point is being represented by a vector of `n_comp` dimensions rather than the original 500!\n",
    "\n",
    "In this case every component represents a topic and each data point is represented as a combination of those topics. The value for each topic can be interpreted as how strong the relationship between the text and that particular topic is.\n",
    "\n",
    "Check this for the 1st element of the text data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "6win8Aii1mJm"
   },
   "outputs": [],
   "source": [
    "print(f\"Original text:\\n{data[0]}\\n\")\n",
    "\n",
    "print(f\"Representation based on topics:\\n{reduced_vect_data[0]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "AnDi6ub3qz8Z"
   },
   "source": [
    "Looks like this text can be expressed as a combination of the first, fourth and fifth topic. Specially the later two.\n",
    "\n",
    "At this point you might wonder what these topics are. Since we didn't provide labels, these topics arised from the data. To have a sense of what these topics are, plot the top 20 words for each topic:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "0ehHXYtlrVBl"
   },
   "outputs": [],
   "source": [
    "# Define function for plotting top 20 words for each topic\n",
    "def plot_words_for_topics(n_comp, nmf, feature_names):\n",
    "  fig, axes = plt.subplots(((n_comp-1)//5)+1, 5, figsize=(25, 15))\n",
    "  axes = axes.flatten()\n",
    "\n",
    "  for num_topic, topic in enumerate(nmf.components_, start=1):\n",
    "\n",
    "    # Plot only the top 20 words\n",
    "\n",
    "    # Get the top 20 indexes\n",
    "    top_indexes = np.flip(topic.argsort()[-20:])\n",
    "\n",
    "    # Get the corresponding feature name\n",
    "    top_features = [feature_names[i] for i in top_indexes]\n",
    "\n",
    "    # Get the importance of each word\n",
    "    importance = topic[top_indexes]\n",
    "\n",
    "    # Plot a barplot\n",
    "    ax = axes[num_topic-1]\n",
    "    ax.barh(top_features, importance, color=\"green\")\n",
    "    ax.set_title(f\"Topic {num_topic}\", {\"fontsize\": 20})\n",
    "    ax.invert_yaxis()\n",
    "    ax.tick_params(labelsize=15)\n",
    "\n",
    "  plt.tight_layout()\n",
    "  plt.show()\n",
    "\n",
    "# Run the function\n",
    "plot_words_for_topics(n_comp, nmf, feature_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "RhBPfc2erQnj"
   },
   "source": [
    "Let's try to summarize each topic based on the top most common words for each one:\n",
    "\n",
    "- The first topic is hard to describe but seems to be related to people and actions. \n",
    "\n",
    "- The second one is clearly abouth tech stuff.\n",
    "\n",
    "- Third one is about religion.\n",
    "\n",
    "- Fourth one seems to revolve around sports and/or games.\n",
    "\n",
    "- And the fifth one about education and/or information.\n",
    "\n",
    "\n",
    "This makes sense considering the example with the first element of the text data. That text is mostly about cars (sports) and information.\n",
    "\n",
    "Pretty cool, right?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "FLXcVA8ustmN"
   },
   "source": [
    "The following function condenses the previously used code so you can play trying out with different number of components:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "9qeWwPAIsVEn"
   },
   "outputs": [],
   "source": [
    "def try_NMF(n_comp):\n",
    "  nmf = NMF(n_components=n_comp, random_state=42)\n",
    "  nmf.fit(vect_data)\n",
    "  feature_names = vectorizer.get_feature_names_out()\n",
    "  plot_words_for_topics(n_comp, nmf, feature_names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Q9tYNerssi0d"
   },
   "outputs": [],
   "source": [
    "# Try different values!\n",
    "try_NMF(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "BcdGRKOTxT5E"
   },
   "source": [
    "**Congratulations on finishing this ungraded lab!** Now you should have a clearer understanding of how to implement dimensionality reduction techniques. \n",
    "\n",
    "The great thing about dimensionality reduction algorithms is that aside from making training and predicting faster, they perform some kind of automatic feature engineering by transforming the raw data into more meaningful representations.\n",
    "\n",
    "\n",
    "**Keep it up!**"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "C3_W2_Lab_2_Algorithmic_Dimensionality.ipynb",
   "private_outputs": true,
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
